Faster Construction of Optimal Binary Split Trees

نویسندگان

  • James H. Hester
  • Daniel S. Hirschberg
  • S.-H. H. Huang
  • Chak-Kuen Wong
چکیده

A binary split tree is a search structure combining features of heaps and binary search trees. Building an optimal binary split tree was originally conjectured to be intractable due to di culties in applying dynamic programming techniques to the problem. However, two algorithms have recently been published which generate optimal trees in O(n5) time, for records with distinct access probabilities. An extension allowing non-distinct access probabilities required exponential time. These algorithms consider a range of values when only a single value is possible. A dynamic programming method for determining the correct value is given, resulting in an algorithm which builds an optimal binary split tree in O(n5) time for nondistinct access probabilities and (n4) time for distinct access probabilities. 2 Introduction A binary split tree (BST) is a structure for storing records on which searches will be performed, assuming that the probabilities of access are known in advance. For every subtree T in a BST, the record with the highest access probability of all records in T is stored in the root of T . The remaining records are distributed among the left and right subtrees of T such that the keys of all records in the left subtree are less than the keys of all records in the right subtree. Each node in a BST contains the key value of the record in that node and a split value which lexically divides the values of the keys in the left and right subtrees. A simple split value is the value of the largest key in the left subtree. Under the assumption of distinct access probabilities and no failed searches, for any given set of n records, the key to be put in the root is predetermined but the split value for the root may be chosen to divide the remaining n 1 records between the left and right subtrees in any of n possible ways. If failed searches are considered, the split value may be any of n+2 possibilities. For optimal BSTs, the number of possible divisions is n 2 if failed searches are not considered and n if failed searches are considered. This is due to the easily proven fact that, if there are two or more non-zero probabilities (access probabilities or failure probabilities) in any optimal subtree X, then at least one non-zero probability must be in each of the two subtrees of X. Binary split trees were introduced by Sheil [5], who conjectured that the arbitrary removal of nodes with high access probabilities from the lexicographic ordering (for placement in roots of higher subtrees) made the normal dynamic programming techniques inapplicable. However, Huang and Wong [1] noted that the keys missing from any given range must be the keys with the largest access probabilities in that range of keys, thus allowing a representation of the set of keys in a subtree by specifying a range of keys and a count of the number of keys missing 3 from that range. This led to a (n5) time and (n3) space dynamic programming algorithm to construct optimal BSTs in a manner similar to Knuth's algorithm [3] for constructing optimal binary search trees. Shortly after Huang and Wong's rst paper, Perl [4] presented an independently derived algorithm similar to Huang and Wong's which had the same time and space bounds, but which also took into account probabilities of failed searches. Perl showed that the technique used by Knuth to reduce the asymptotic time complexity of his optimal binary tree constructer by a factor of n could not be applied to the construction of BSTs. This paper also presented an algorithm which allowed non-distinct access probabilities by including some top-down decision making which resulted in an exponential time algorithm. Unfortunately, the algorithms (for distinct access probabilities) presented by Huang and Wong and by Perl pick the minimum weight of subtrees resulting from considering values of a variable over a range, when only one value could be correct. This adds a factor of O(n) time to the algorithm, and raises the question of whether the algorithms might pro er a minimum cost value of this variable which is not attainable. If this is so, these errors may be made independently for every subtree, and the algorithms may result in a structure that is not a valid split tree. Some quick thought on the problem and a little testing gave us reason to believe that the cost corresponding to the correct value of the variable will always be better than the costs corresponding to all of the incorrect values, and thus the correct value will always be picked. We have not proven this, since the results of this paper render the question academic. Huang and Wong [2] proposed a more generalized split tree, which relaxes the constraint that the record with the highest access probability must be in the root. They presented an algorithm for constructing trees of this nature, which works even in the case of non-distinct access probabilities, but requires (n5) time. 4 This paper is the result of a merger of two parallel collaborations. Huang and Wong modi ed their earlier algorithm to remove the super uous loop and discussed extending their new algorithm to handle failed accesses in a manner similar to the method given by Perl [4]. Independently, Hester and Hirschberg pointed out the possible error of constructing an incorrect tree and developed an analogous algorithm, but directly incorporated Perl's method of handling failed accesses. They also extended their algorithm to the problem of constructing an optimal binary split tree in polynomial time in the presence of non-distinct access probabilities. We thus present an algorithm which calculates the value to be used (without the extra loop), resulting in construction of an optimal BST for records with distinct access probabilities in (n4) time, while still using only (n3) space. This algorithm also constructs an optimal BST for records with non-distinct access probabilities in O(n5) time by saving some extra still (n3) information to allow postponing top-down decisions until su cient constraints are accumulated to eliminate the exponential cost of these decisions. 0. Definitions and Data Structures We are given n records indexed from 1 to n. Each record ri has a key Key(i) such that Key(i) < Key(j) for all i < j. If the records are not so ordered, we can prepend a sort to the algorithm without adversely a ecting its asymptotic costs. Each record ri also has an access probability p(i). In addition, to account for failed searches, we are given failure probabilities q(i) for 0 i n which are the probabilities of searching for a key K such that Key(i)><>>: 0 l = i GTL(i; j; k; l 1) l 6= i; p(l) p R[i; j; k] GTL(i; j; k; l 1) + 1 l 6= i; p(l) > p R[i; j; k] Substituting `EQ' for `GT ', ` 6=' for ` ' and `=' for `>' in this recurrence gives the recurrence for EQL(i; j; k; l). A similar recurrence could also be constructed for GEL(i; j; k; l), which is simpler and su cient for distinct access probabilities, but the two separate values are necessary in the next section when we relax the constraint on access probabilities. Since only the previous values (in terms of l) are needed at any given time, our algorithm just uses the scalar variables GTL and EQL within the loops which consider possible values for i; j; k and l, but the more verbose functional de nition of GEL(i; j; k; l) is useful for clarity in the following de nitions. 8 W [i; j; k] is the weight of a subtree spanning i; j; k , which is de ned as W [i; j; k] = X p(l)2 i;j;k p(l) + j 1 Xl=i q(l) COT [i; j; k] is the cost of an optimal subtree spanning i; j; k , which is de ned as COT [i; j; k] =W [i; j; k] + min i><>>: COT i; l;GEL(i; j; k; l) + COT l; j; k+1 GEL(i; j; k; l) 9>>=>>; 1. Non-Distinct Access Probabilities The de nitions in the previous section permit design of a (n4) time and (n3) space dynamic programming algorithm for constructing optimal BSTs similar to Knuth's algorithm [3] for constructing optimal binary search trees. We now present extensions of these de nitions which lead to an algorithm that allows nondistinct access probabilities with the same space complexity and requires at most an extra factor of O(n) time. Thus the algorithm requires O(n5) time, but this is only an upper bound based on large numbers of equal access probabilities. When access probabilities are distinct, this algorithm requires only (n4) time. The major problem when there may be non-distinct access probabilities is that, during the calculation of COT, SP and kL, the root of a given subtree may be unknown, since it could be any one of a set of non-missing records with maximal access probabilities. It may even be unknown which records are in this set, i.e. which of the records with access probabilities equal to the root are not missing. The previous algorithm [4] shifted to a top-down approach at this point, which resulted in an exponential time complexity. We note that the only pieces of information needed during the calculation of COT, SP and kL are the weights of the subtrees, and that these weights are not dependent on which one of the records with equal access probabilities is the root. The only problem is in predicting from which 9 subtree the root, that eventually will be picked, is missing. This is not xed, so we check all possible distributions of potential roots between the subtrees without ever committing to exactly which record is the root of the current subtree. The nal decisions will be made in a top-down fashion when the tree is constructed, at which time only the optimal subtrees are considered, and thus the exponential work is avoided. For the following de nitions required by our algorithm, OBST, let T be any tree spanning i; j; k and let P be the access probability of any key that might be the root of T . When we compare records, saying one is greater, less, etc. than another, we are referring to the access probabilities of those records. This also applies when we compare a record to P . We re ne the de nition of R[i; j; k] to be the index of the rightmost possible root of T . This gains only a minor savings in time, but calculation of the following arrays supplies this information at no additional cost. Let EQ[i; j; k] be the number of records with indices in the range i to j which are equal to P (recall that P is determined, in part, by k). Similarly, let LT [i; j; k] be the number of records which are less than P . We de ne the function GT(i; j; k) = j i LT [i; j; k] +EQ[i; j; k] as the number of records in the range of T which are greater than P . Also, we de ne EQm(i; j; k) = k GT(i; j; k) as the number of records with indices in the range i to j which are equal to P and missing from T (note that k includes all records which are greater than P and possibly some that are equal to P ). Since GT(i; j; k) and EQm(i; j; k) can be calculated from other known values, they are not stored by the algorithm, but are used for notational convenience. Note that, although the value of GEL(i; j; k; l) = GTL +EQL was equal to the number of records missing from the left subtree when only distinct access probabilities were considered, this is not true in OBST, since some unknown 10 number of the records counted in EQL may not be missing. We de ne EQmL to be the number of records counted in EQL which are missing from the left subtree of T . The number of missing records in the left subtree thus is represented in OBST by the value of GTL+EQmL, which eventually will be stored in kL[i; j; k] after the optimal value of EQmL is found. There may be many possible values of EQmL, which correspond to decisions about whether the roots of T and subtrees of T are chosen from the left or right of their respective subtrees. When looking for optimal splits, we bound the possible values of EQmL and then check all values within our bounds. This does not determine a root for T , but provides constraints which are used during the nal top-down construction of the tree to ensure that the root picked is consistent with the remainder of the tree. De ne EQR = EQ[i; j; k] EQL and EQmR = EQm(i; j; k)+1 EQmL. Since these values can be calculated from other known values, they are not stored by our algorithm, but are calculated as needed. They are de ned here for clarity in the following bounds. Bounds on EQmL: (1) (number of keys missing from left subtree) (number of keys missing from both subtrees) (2) GTL + EQmL k + 1 rewriting (1) (3) EQmL EQL ( ) EQmL minfEQL; k + 1 GTLg from (2) and (3) (4) EQR EQmR (5) EQR EQm(i; j; k) + 1 EQmL rewriting (4) (6) EQmL 0 ( ) EQmL maxf0;EQm(i; j; k) + 1 EQRg from (5) and (6) Thus, for any l splitting T (and the values of EQL and GTL corresponding to that split), ( ) and ( ) give us maxf0;EQm(i; j; k) + 1 EQRg EQmL minfEQL; k + 1 GTLg 11 Note that, any time there is only one possible root, these bounds restrict EQmL to either 1 or 0, depending on whether the single possible root of T is in the left or right subtree of T . Thus, the extra factor of n on the time of this algorithm is only an upper bound; the algorithm is o(n5) approaching (n4) when there are few records with equal access probabilities, and is (n4) when records have distinct access probabilities. The calculation of W [i; j; k] is complicated by the fact that, when a record with index j such that p(j)=P is being considered by the dynamic programming processes, it is sometimes unclear whether record j is present or missing from the subtree. It is simple enough when EQm(i; j; k) = 0, since record j must be present if j=P and no records equal to P are missing from the subtree. When EQm(i; j; k) > 0, we do not know which of the records that are equal to P are missing, but we do know their weight and how many of them there are. Thus we avoid making any decision about whether record j is missing by subtracting p(j).EQm(i; j; k) (= the total weight of the records which are equal to P and missing from the subtree) from W i; j; k EQm(i; j; k) (= the weight of the subtree with none of the records equal to P missing). We construct the tree in a natural top-down fashion based on the values of R, SP, kL and Key as before, but the choice of the root for each subtree is made in postorder, after the subtrees below it have been fully constructed. A global array of ags is used to indicate which records have been allocated as roots so far, and the choice of the root for a subtree is restricted to any record in the range of the subtree which has the correct access probability and has not already been allocated as a root of some lower subtree. We search backwards from the rightmost possible root in the range, which may save a little time, but still yields an O(n) search for each root, making the time required to construct the tree (after the arrays have 12 been set up) O(n2). Thus the total time for the algorithm is dominated by the O(n5) time required to calculate COT, SP, and kL. 2. The Algorithm OBST We now present the algorithm OBST for calculating an optimal BST when there may be non-distinct access probabilities. The output of OBST is the variable Tree, which points to the root of an optimal BST for the given input. The input value n and input functions Key, p and q are global to all procedures. The internal arrays R, W, COT, SP, kL, EQ, LT and FLAG are also global to all procedures. 13 OBST(n;Key; p; q;Tree): / calculate optimal BST for non-distinct access probabilities / beginInitR() InitW() Compute() for i 1 until n do FLAG[i] `free' Tree Build Tree(0; n+1; 0) end InitR(): / initialize R, EQ, and LT / for i 0 until n 1 do begin R[i; i+1; 0] i+ 1 , R[i; i+1; 1] 0 EQ[i; i+1; 0] 1 , EQ[i; i+1; 1] 0 LT [i; i+1; 0] 0 , LT [i; i+1; 1] 0 for j i+ 2 until n+ 1 do begin if p(j) > p R[i; j 1; 0] then begin / new root / R[i; j; 0] j EQ[i; j; 0] 1 end else if p(j) = p R[i; j 1; 0] then begin / rightmost root / R[i; j; 0] j EQ[i; j; 0] EQ[i; j 1; 0] + 1 end else begin / less than root / R[i; j; 0] R[i; j 1; 0] EQ[i; j; 0] EQ[i; j 1; 0] end LT [i; j; 0] j i EQ[i; j; 0] for k 1 until j i 1 do CheckR(i; j; k) end end 14 CheckR(i; j; k): / check general conditions for R, EQ, and LT / if p(j) > p R[i; j 1; k 1] then begin / missing / R[i; j; k] R[i; j 1; k 1] EQ[i; j; k] EQ[i; j 1; k 1] LT [i; j; k] LT [i; j 1; k 1] end else if p(j) > p R[i; j 1; k] then begin / new root / R[i; j; k] j EQ[i; j; k] 1 LT [i; j; k] LT [i; j 1; k] + EQ[i; j 1; k] end else if p(j) = p R[i; j 1; k] then begin / rightmost root / R[i; j; k] j EQ[i; j; k] EQ[i; j 1; k] + 1 LT [i; j; k] LT [i; j 1; k] end else begin / less than root / R[i; j; k] R[i; j 1; k] EQ[i; j; k] EQ[i; j 1; k] LT [i; j; k] LT [i; j 1; k] + 1 end InitW(): / initialize W / beginW [n; n+1; 0] q(n) W [n; n+1; 1] q(n) for i 0 until n 1 do begin W [i; i+1; 0] q(i) + p(i+1) W [i; i+1; 1] q(i) for j i + 2 until n+ 1 do begin W [i; j; 0] W [i; j 1; 0] + q(j 1) + p(j) for k 1 until j i do if p(j) > p R[i; j 1; k 1] then / missing / W [i; j; k] W [i; j 1; k 1] + q(j 1) else if p(j) < p R[i; j 1; k] or EQm(i; j; k) = 0 then / less than root / / or no records equal to root missing / W [i; j; k] W [i; j 1; k] + q(j 1) + p(j) else / equal to root and maybe missing / W [i; j; k] W i; j; k EQm(i; j; k) p(j).EQm(i; j; k) end end end 15 Compute(): / initialize COT, SP, and kL / beginfor i 0 until n do begin COT [i; i+1; 0] W [i; i+1; 0] COT [i; i+1; 1] W [i; i+1; 1] end for d 2 until n+ 1 do for i 0 until n+ 1 d do begin j i + d for k 0 until d 1 do Find Min(i; j; k) COT [i; j; d] W [i; j; d] end end Find Min(i; j; k): / nd optimal COT, SP, and kL given i; j; k / beginGTL 0 EQL 0 minc 1 for l i+ 1 until j 1 do begin if p(l) > p R[i; j; k] then GTL GTL + 1 if p(l) = p R[i; j; k] then EQL EQL + 1 for EQmL maxf0;EQm(i; j; k) + 1 EQRg / recall that EQR = EQ[i; j; k] EQL / until minfEQL; k + 1 GTLg do begin try COT [i; l;GTL+EQmL]+COT l; j; k+1 (GTL+EQmL) if try < minc then begin minc try minl l mink GTL + EQmL end end end COT [i; j; k] minc+W [i; j; k] SP [i; j; k] minl kL[i; j; k] mink end 16 BUILD TREE(i; j; k):/ return pointer to root of optimal subtree spanning i; j; k /beginif i = n or k = j i or R[i; j; k] = 0 or (j = n+1 and k = j i 1) thennode null pointerelse beginnode pointer to a new tree nodenode.SPLIT Key SP[i; j; k]if i j 1 then beginnode.LEFT NULLnode.RIGHT NULLendelse beginnode.LEFT BUILD TREE i;SP [i; j; k]; kl[i; j; k]node.RIGHT BUILD TREE SP [i; j; k]; j; k+1 kl[i; j; k]endx R[i; j; k]while p(x) 6= p R[i; j; k] or FLAG[x] 6= `free' dox x 1FLAG[x] `used'node.KEY Key(x)endreturn nodeendConclusions and Open QuestionsAn algorithm has been presented for constructing optimal binary split treesin (n4) time when access probabilities are distinct, and O(n5) time when ac-cess probabilities are non-distinct. Taking into account the added complexity ofchoosing split values and assuming the necessity of an extra O(n) time to allownon-distinct access probabilities, the e ciency of this algorithm is comparable tothat of Knuth's O(n3) algorithm for nding an optimal binary search tree. SincePerl [4] showed that the technique used by Knuth [3] to obtain an O(n) speedupfor optimal binary search trees reducing the time to O(n2) cannot be applied tooptimal BSTs, an open question arises as to whether or not there is some other17 technique (perhaps similar to Knuth's) that can be applied to BSTs to reduce thetime of the algorithm presented here. One technique worth consideration is theQuadrangle Inequality presented by Yao [6].Another open question is that of the relative value of the algorithm presentedhere with that of the (n5) algorithm presented by Huang and Wong [2] for gen-eralized split trees (which handles equi-probable keys). They show by simulationthat optimum generalized split trees can have faster expected search times thanoptimum split trees, but that the di erence appears never to be great. However,there is no theoretic evidence to that e ect.References1. S.-H. S. Huang and C. K. Wong, Optimal binary split trees,J. Algorithms 5 (1984) 69{79.2. S.-H. S. Huang and C. K. Wong, Generalized Binary Split Trees, ActaInformatica 21 (1984) 113{123.3. D. E. Knuth, \The Art of Computer Programming," Vol. 3, \Sorting andSearching," pp. 433{439, Addison{Wesley, Reading, Mass., 1973.4. Y. Perl, Optimum split trees, J. Algorithms 5 (1984) 367{374.5. B. A. Sheil, Median split trees: A fast lookup technique for frequentlyoccurring keys, Comm. ACM 21 (1978) 947{958.6. F. F. Yao, E cient Dynamic Programming Using Quadrangle Inequalities,The 12th Annual ACM Symposium on Theory of Computing Los Angeles,Calif., (April 28{30, 1980) 429{435.18 -

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Construction of Optimal Binary Split Trees in the Presence of Bounded Access Probabilities

A binary split tree is a search structure combining features of heaps and binary search trees. The fastest known algorithm for building an optimal binary split tree requires (n 4) time if the keys are distinct and O(n 5) time if the keys are non-distinct. (n 3) space is required in both cases. A modiication is introduced which reduces a factor of n 2 in the asymptotic time to a factor of n lg(n...

متن کامل

Optimal Binary Search Trees with Near Minimal Height

Suppose we have n keys, n access probabilities for the keys, and n+1 access probabilities for the gaps between the keys. Let hmin(n) be the minimal height of a binary search tree for n keys. We consider the problem to construct an optimal binary search tree with near minimal height, i.e. with height h ≤ hmin(n) +∆ for some fixed ∆. It is shown, that for any fixed∆ optimal binary search trees wi...

متن کامل

Classification Trees With Unbiased Multiway Splits

Two univariate split methods and one linear combination split method are proposed for the construction of classification trees with multiway splits. Examples are given where the trees are more compact and hence easier to interpret than binary trees. A major strength of the univariate split methods is that they have negligible bias in variable selection, both when the variables differ in the num...

متن کامل

On a Sublinear Time Parallel Construction of Optimal Binary Search Trees

We design an eecient sublinear time parallel construction of optimal binary search trees. The eeciency of the parallel algorithm corresponds to its total work (the product time processors). Our algorithm works in O(n 1? log n) time with the total work O(n 2+2), for an arbitrarily small constant 0 < 1 2. This is optimal within a factor n 2 with respect to the best known sequential algorithm give...

متن کامل

Optimal Region for Binary Search Tree, Rotation and Polytope

Given a set of keys and its weight, a binary search tree(BST) with the smallest path length among all trees with the keys and the weight is called optimal tree. Knuth showed that the optimal tree is computed in the time of square of the number of keys. In this paper, we propose algorithms that divide the weight space into regions corresponding to optimal trees by a construction algorithm of con...

متن کامل

Split Selection Methods for Classification Trees

Classification trees based on exhaustive search algorithms tend to be biased towards selecting variables that afford more splits. As a result, such trees should be interpreted with caution. This article presents an algorithm called QUEST that has negligible bias. Its split selection strategy shares similarities with the FACT method, but it yields binary splits and the final tree can be selected...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Algorithms

دوره 7  شماره 

صفحات  -

تاریخ انتشار 1986